Infectious Diseases in Germany by Stefan Borchardt

Univariate Plots Section

Iteration 1

## Using incidence as value column: use value.var to override.

I use data on infectious diseases which is collected by the Robert Koch Institute in Germany as a official statistic. From various customizable selections available, I chose the incidence of 14 diseases which can cause stomach flu or diarrhea, with the help of a doctor. My idea is that patient’s characteristics can help to point out the most likely causes of similar symptoms.

Because I selected which data to include through the interface at https://survstat.rki.de , I already have some understanding of the structure of the data. On the other hand, because I did the data wrangling myself, I have to check that joins and value transformations worked as intended.

At first, I display some textual summaries:

## 'data.frame':    12416 obs. of  16 variables:
##  $ date         : Date, format: "2001-01-01" "2001-01-08" ...
##  $ age          : Factor w/ 16 levels "A00..00","A01..01",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ campylobacter: num  0.29 0.44 0.44 1.6 0.44 0.58 0.29 0.15 0.73 0.44 ...
##  $ ecoli        : num  0.44 0.73 2.03 3.34 3.2 1.74 2.91 1.74 2.32 2.76 ...
##  $ ehec         : num  0 0.44 0.29 0 0.15 0 0.15 0 0.15 0.29 ...
##  $ giardia      : num  0 0 0 0 0.15 0 0 0.29 0 0.44 ...
##  $ hus          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ influenza    : num  0 0 0 0.15 0 0.15 0 0.29 0.58 0.15 ...
##  $ legionella   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ meningokokken: num  0 0.44 0.15 0 0 0 0.44 0.29 0.15 0.15 ...
##  $ norovirus    : num  0 0 0 0.15 0.29 0.15 0.29 0.44 0.44 1.02 ...
##  $ rotavirus    : num  9.59 20.92 36.18 44.17 44.61 ...
##  $ salmonella   : num  1.45 1.89 2.03 2.32 1.6 1.31 1.02 1.31 2.03 1.45 ...
##  $ shigella     : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ typhus       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ yersiniosis  : num  0.15 0.15 0.15 0 0 0.44 0 0 0 0.15 ...
##       date                 age       campylobacter       ecoli        
##  Min.   :2001-01-01   A00..00: 776   Min.   :0.000   Min.   : 0.0000  
##  1st Qu.:2004-09-21   A01..01: 776   1st Qu.:0.940   1st Qu.: 0.0300  
##  Median :2008-06-06   A02..02: 776   Median :1.430   Median : 0.0700  
##  Mean   :2008-06-05   A03..03: 776   Mean   :1.681   Mean   : 0.7247  
##  3rd Qu.:2012-02-20   A04..04: 776   3rd Qu.:2.150   3rd Qu.: 0.4200  
##  Max.   :2015-11-12   A05..09: 776   Max.   :8.740   Max.   :14.3400  
##                       (Other):7760                                    
##       ehec           giardia           hus             influenza      
##  Min.   :0.0000   Min.   :0.000   Min.   :0.000000   Min.   :  0.000  
##  1st Qu.:0.0000   1st Qu.:0.040   1st Qu.:0.000000   1st Qu.:  0.000  
##  Median :0.0200   Median :0.090   Median :0.000000   Median :  0.000  
##  Mean   :0.1024   Mean   :0.116   Mean   :0.009513   Mean   :  1.366  
##  3rd Qu.:0.1100   3rd Qu.:0.140   3rd Qu.:0.000000   3rd Qu.:  0.260  
##  Max.   :2.2900   Max.   :1.570   Max.   :0.900000   Max.   :222.650  
##                                                                       
##    legionella      meningokokken       norovirus         rotavirus      
##  Min.   :0.00000   Min.   :0.00000   Min.   : 0.0000   Min.   :  0.000  
##  1st Qu.:0.00000   1st Qu.:0.00000   1st Qu.: 0.3775   1st Qu.:  0.180  
##  Median :0.00000   Median :0.00000   Median : 1.0100   Median :  0.510  
##  Mean   :0.00852   Mean   :0.03674   Mean   : 3.6344   Mean   :  6.128  
##  3rd Qu.:0.01000   3rd Qu.:0.02000   3rd Qu.: 2.9500   3rd Qu.:  3.330  
##  Max.   :0.35000   Max.   :1.31000   Max.   :78.1300   Max.   :184.350  
##                                                                         
##    salmonella        shigella          typhus          yersiniosis   
##  Min.   : 0.000   Min.   :0.0000   Min.   :0.000000   Min.   :0.000  
##  1st Qu.: 0.500   1st Qu.:0.0000   1st Qu.:0.000000   1st Qu.:0.030  
##  Median : 1.000   Median :0.0000   Median :0.000000   Median :0.090  
##  Mean   : 2.052   Mean   :0.0208   Mean   :0.001782   Mean   :0.323  
##  3rd Qu.: 2.280   3rd Qu.:0.0200   3rd Qu.:0.000000   3rd Qu.:0.350  
##  Max.   :23.370   Max.   :1.0700   Max.   :0.150000   Max.   :4.300  
## 
##  [1] "A00..00" "A01..01" "A02..02" "A03..03" "A04..04" "A05..09" "A10..14"
##  [8] "A15..19" "A20..24" "A25..29" "A30..39" "A40..49" "A50..59" "A60..69"
## [15] "A70..79" "A80."

The factor age is not evenly spaced, young people have finer granularity.

To see if my data wrangling is plausible I use the library hts to plot a grouped time series:

Some age groups seem to be more prone to these infectious diseases and there are seasonal patterns.

Univariate Analysis

Iteration 1

What is the structure of your dataset?

The incidence of the diseases, as cases per 100,000, is reported by week number of the last 15 years and by age group of the patients. After combining 14 separate files, the dataset contains 12,416 observations of 16 variables: The incidence of the 14 diseases for each week and age group. I filled in zeros where necessary, so that all age groups and diseases are present at any week to help analyzing time series at later stages. The patient’s age is included in groups of each year for small children, groups of five years for persons from 5 to 29 years and 10-year groups for ages 30 to 79. People over 80 years are the last group.

What is/are the main feature(s) of interest in your dataset?

I’d like to know, when a patient sees a doctor with gastrointestinal symptoms, what are the most likely infectious diseases to check. From a first look, a couple of the diseases have a maximum incidence lower than 1 per 100,000, but norovirus, salmonella and campylobacter have medians above 1 per 100,000.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

I assume the age of the patient and the time of the year to be other important factors. For a first glimpse, I plotted the dataset as a time series using hts, which revealed a seasonal pattern for some diseases and differences between the age groups.

Did you create any new variables from existing variables in the dataset?

No, existing values were transformed only.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

The interface I had to extract the data from, SurvStat@RKI 2.0 https://survstat.rki.de , allows to download only two dimensions at once, for which I chose patient age group and the week. The 14 data files typically had 16 columns and 731 rows each. I combined these files into one by joining the separate data frames after I had changed the data format of some columns. Finally, I melted the combined data into one set of 12,416 observations of 16 variables.

Univariate Plots Section

Iteration 2

The first look at the data was so promising, that I decided to increase the number of variables I take into consideration. Unfortunately, the limitations of the interface required to download a total of 72 files to additionally include gender and region of the patients. I removed the five most rare diseases of iteration 1.

Again, I check the structure and get an overview:

## 'data.frame':    99456 obs. of  14 variables:
##  $ date  : Date, format: "2001-01-01" "2001-01-01" ...
##  $ age   : Factor w/ 16 levels "A00","A01","A02",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ camp  : num  0 0 0 4.75 0 0 0 0.51 0.48 1.64 ...
##  $ week  : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ ecol  : num  1.66 0 0 0 0 0 0 0 0 0 ...
##  $ ehec  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ giar  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ infl  : num  0 0 0 0 0 0 0.27 0 0 0 ...
##  $ noro  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ rota  : num  8.28 4.89 1.61 1.58 0 0.6 0.27 0.26 0.24 0 ...
##  $ salm  : num  0 0 0 3.17 1.56 1.49 0 0.26 0 0.47 ...
##  $ yers  : num  0 0 0 0 0 0.3 0.55 0 0 0 ...
##  $ gender: Factor w/ 2 levels "fem","mal": 1 1 1 1 1 1 1 1 1 1 ...
##  $ region: Factor w/ 4 levels "e","n","s","w": 2 2 2 2 2 2 2 2 2 2 ...
##       date                 age             camp            week      
##  Min.   :2001-01-01   A00    : 6216   Min.   : 0.00   Min.   : 1.00  
##  1st Qu.:2004-09-23   A01    : 6216   1st Qu.: 0.76   1st Qu.:13.00  
##  Median :2008-06-10   A02    : 6216   Median : 1.41   Median :26.00  
##  Mean   :2008-06-09   A03    : 6216   Mean   : 1.79   Mean   :26.42  
##  3rd Qu.:2012-02-26   A04    : 6216   3rd Qu.: 2.29   3rd Qu.:39.00  
##  Max.   :2015-11-19   A05    : 6216   Max.   :24.81   Max.   :53.00  
##                       (Other):62160                                  
##       ecol              ehec              giar             infl        
##  Min.   : 0.0000   Min.   : 0.0000   Min.   :0.0000   Min.   :  0.000  
##  1st Qu.: 0.0000   1st Qu.: 0.0000   1st Qu.:0.0000   1st Qu.:  0.000  
##  Median : 0.0000   Median : 0.0000   Median :0.0000   Median :  0.000  
##  Mean   : 0.8432   Mean   : 0.1091   Mean   :0.1218   Mean   :  1.539  
##  3rd Qu.: 0.2600   3rd Qu.: 0.0000   3rd Qu.:0.1200   3rd Qu.:  0.120  
##  Max.   :50.4400   Max.   :10.4400   Max.   :8.2700   Max.   :319.190  
##                                                                        
##       noro             rota              salm             yers        
##  Min.   :  0.00   Min.   :  0.000   Min.   : 0.000   Min.   : 0.0000  
##  1st Qu.:  0.21   1st Qu.:  0.120   1st Qu.: 0.400   1st Qu.: 0.0000  
##  Median :  0.93   Median :  0.480   Median : 0.930   Median : 0.0000  
##  Mean   :  4.16   Mean   :  6.902   Mean   : 2.193   Mean   : 0.3944  
##  3rd Qu.:  3.10   3rd Qu.:  3.170   3rd Qu.: 2.290   3rd Qu.: 0.2000  
##  Max.   :284.53   Max.   :466.500   Max.   :56.750   Max.   :24.6100  
##                                                                       
##  gender      region   
##  fem:49728   e:24864  
##  mal:49728   n:24864  
##              s:24864  
##              w:24864  
##                       
##                       
## 

Also, I plot a time series to see that the data wrangling worked.

Same as before, only adding gender is a bit disappointing. But there are differences between the regions.

A look into the distribution of incidences, also experimenting on equidistant breaks on log scale:

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

At first I was suprised that the occurence of diseases seemed evenly distributed over all ages, then I recoded the factor levels:

Kids are more prone to infectious diseases it seems.

It is hard to compare regions by number of disease occurences, so I gradually increased the threshold with the help of the incidence histogram from above:

High incidences seem to occur more often in the East.

I use the same approach with the diseases:

I got the impression that only 3-5 diseases reach high incidences, so I will have a look at the variability later in the bivariate section.

I remember seasonal patterns from the initial time series plot. So this time I group by week of the year:

Because the data is current (updated every Wednesday) I have to consider that values for December are still lower at this time. The numbers seem to show that the media attention that influenza usually gets should actually be on rotavirus and norovirus. Salmonella and campylobacter have a moderate high in summer, but much higher incidences can be seen for the two viruses in winter and spring.

Are there long-term trends? I resample the time series as quarterly and yearly values for the disease incidences in order to smooth:

Norovirus is on the rise while rotavirus infections are declining. Immunization shots against rotavirus have been available since 2006 and became officially recommended (and paid for) in 2013.

I have to consider that these are the cases that were reported to the Robert Koch Institute. The inclination to report the incidents might be influenced by, for instance, the amount of paperwork required (2 pages) or the awareness or diagnostic capabilities of the doctors.

I tried to find additional data sources about incidences of gastroenteritis from a health insurance. It contains the cases diagnosed as non-infectious gastroenteritis which caused sick leaves in the year 2005. That is the only year where data for memberships is available, so that I can approximate the incidence for the total population:

With the help of a doctor I found out that at least 80% of the cases have an infectious cause, which means I should see a yearly incidence of well above 10,000 in the RKI’s data. The plotted values are much lower. I checked this against the yearly values through the institute’s interface and noticed even much lower values (240-560) there.

I think the reason for the different values is that some incidences are summed up when actually a mean should be calculated. For instance, an incidence of 4 in Region East and 6 in Region North does not mean a combined incidence of 10, but of 5. On the other hand, incidences for diseases and ages should be added.

I’ll start over with separate values for cases and population.

Univariate Analysis

Iteration 2

What is the structure of your dataset?

The dataset contains 99,456 observations of 13 variables. For 777 weeks I have the incidence (as cases per 100,000) of nine not very rare infectious diseases, which are related to gastrointestinal symptoms, for patients grouped by age, gender and region of Germany. Again, the observations have been filled with zeros to help with time series analysis later on. A redundant column for the week of the year has been included. The regions are North, East, South and West, all of which contain bigger cities and rural areas. Data with unclear gender was already omitted when downloading the data. The size of the dataset is approximately 6 MB.

What is/are the main feature(s) of interest in your dataset?

I’d like to know, when a patient sees a doctor with gastrointestinal symptoms, what are the most likely infectious diseases to check. From a first look, gender does not seem to be very important, but age seems to make a difference.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

I assume the region of the patient and the time of the year to be other important factors. My time series plots for validation hinted at these relationships.

Did you create any new variables from existing variables in the dataset?

I temporarily created a variable for age with factor levels of equal width.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

As before, I had to combine several files into one. This time though, I shortened the names of factor levels to a fixed length to make use of the automatic grouping of the hts library.

Univariate Plots Section

Iteration 3

This time I am going to calculate the incidence by myself, because the data source did not clearly document on which numbers they calculated it and they were not available for questions. First, I try to replicate the last plot:

## 'data.frame':    2283228 obs. of  7 variables:
##  $ date      : POSIXct, format: "2001-01-01" "2001-01-01" ...
##  $ disease   : Factor w/ 9 levels "camp","ecol",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ week      : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ pop_year  : int  2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 ...
##  $ age       : int  0 1 2 3 4 5 6 7 8 9 ...
##  $ case_count: int  1 2 1 5 0 1 0 0 0 0 ...
##  $ region    : Factor w/ 4 levels "e","n","s","w": 2 2 2 2 2 2 2 2 2 2 ...

The numbers now match the values I got from the data source. In a perfect world, the green and black lines would be much higher than the blue line.

This also means that I have to check the values of other plots I consider for my final plots. Here is the update for the long-term trends of diseases:

## Using incidence as value column: use value.var to override.

I’d like to try out some forecasting models. I start with the package hts and use a TBATS model from packgage forecast. This model uses spectral analysis and exponential smoothing to fit the time series.

## Using incidence as value column: use value.var to override.

Next is package season, with which I fit a non-stationary cosinor model (sinusoid smoothing) to norovirus only.

## Iteration number 50 of 100 
## Iteration number 100 of 100
## Warning: `show_guide` has been deprecated. Please use `show.legend`
## instead.

A frequently used library is forecast, which supplied the TBATS model for hts.

The models identify a seasonality, but the forecast confidence interval is rather large. I will stick with a simple plot:

## Warning: Removed 15 rows containing missing values (geom_area).

This is going to be a final plot.

Univariate Analysis

Iteration 3

What is the structure of your dataset?

The dataset consists of two parts. To compute the incidences on my own, I downloaded the population data of the German states from https://www-genesis.destatis.de/genesis/online/link/tabellen/12411-0011 . There are limitations on the size of data you can obtain for free, so it is only biennial. Initially, this first part of the data contained 736 obs. of 18 variables, but I reduced it to 648 obs. of 6 variables by summing values for regions and ages 80+.

The second part contains 2,283,228 observations of 7 variables. For 778 weeks I have the case counts of nine not very rare infectious diseases, which are related to gastrointestinal symptoms, for patients grouped by age and region of Germany. Two redundant columns for the week of the year and the year to match with the population data have been included. The size of this dataset is approximately 80 MB.

Did you create any new variables from existing variables in the dataset?

The incidence was calculated, often per plot, to ensure the right case and population counts are matched against each other.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

To give an example, the age was included in the form “6 year olds” or “over 90 year olds” in the population data. I had to extract the numeric value and, because the disease data grouped everyone over 80 together, had to sum all ages above 80. Additionally, I had to sum the values for the various states of Germany to obtain population numbers for the regions.

In terms of unusual distributions, when I look at the quarterly incidences, there is a spike in EHEC diseases and E.Coli drops to 0 in 2015.

Bivariate Plots Section

First, I revisit the disease distribution by age, only this time incidences are not counted but summed up.

And as a windrose plot:

And finally as dot size representing incidence, vaguely reminding of a violin plot:

I think the difference between campylobacter, norovirus and salmonella on the one hand and rotavirus on the other are much clearer if plotted this way. This could be the basis for a final plot.

I had not expected that the differences are that big. Again, I have to consider that the data contains reported cases, maybe adults just do not see a doctor when they have the syptoms.

All plots show that only 3 to 5 diseases seem to be responsible for the majority of cases, which is also something I wanted to investigate earlier in the univariate section. Here is the variability the monthly incidences of each disease in Germany for all ages.

## Warning: Removed 13 rows containing non-finite values (stat_boxplot).

Influenza has a low mean but quite a few outliers, up to 109. Together with the long-term plots from above I conclude that influenza has a rather short season, while rota- and norovirus stay for a longer time of the year.

Next, I’m going to explore how region and disease are related.

Region East shows higher incidences in both plots, maybe this is caused by an one-time event. Additionally, I can see the regional outbreak of EHEC in the east in 2011.

The upper plots show incidences in the east and the rest, the lower plot the difference between them. Region East has higher incidences for all diseases in almost every single month. Are doctors there more eager to report or is this caused by the age structure?

## Using pop_count as value column: use value.var to override.

The size of the dots shows what proportion the different ages have within the east and the rest, while color indicates how big the difference between the regions is. The East has relatively fewer people around 20, and more people 60+. The difference in small kids is relatively low, though, and I should not forget that I am comparing incidences as cases per 100,000 already.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

I saw that there are seasonal patterns in the incidences of some diseases. The incidence is also influenced by the age of the patient. One region, East, has higher incidences in general.

In a side exploration with incidence from another source it became clear that the aspect of reporting behavior is important, but beyond the scope of the data set.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

The age structure of the East and the rest of Germany is different, the consequence of which is hopefully going to be explained in the next section.

What was the strongest relationship you found?

I think the time of the year has the strongest influence on incidence, but I am not sure if I can quantify that.

Multivariate Plots Section

The plot shows the relative difference of incidences (in percent, size of dots) for the diseases and various ages. There is one dot in the color of the region with the highest deviation from the mean for every year from 2001 to 2015.

For rotavirus, there is a rather big deviation for all ages and years in region East. For almost all diseases, region East deviates most for people under 18. Historically, there was a much more centralised healthcare in the East, which might still influence the tendency of doctors to report cases. There are more patterns in the plot, for which I do not have a potential explanation.

After the exploration of age/ region/ disease is exhausted, I will have a look at the seasonality again.

The seasonality, which showed in the plot of the annual means above, is still there, but there are some changes over time. When I forecast noro- or rotavirus later, I will limit the data to years from 2008:

Norovirus has high incidences in small kids and older people. Maybe the seasonality varies with age for norovirus.

What is the pattern for campylobacter?

While the seasonal pattern is similar within the age groups (columns), it looks different between the groups for norovirus. I will consider that in the final version of the seasonality plot.

Before I try a forecast, I have a look at the seasonality of some diseases by ages and regions:

There is no real difference between the regions. The diseases seem to hit all ages at once within the season.

With the new insights I try to forecast some incidences again:

## Using incidence as value column: use value.var to override.

The thick black and the thin red line are the actual incidence of a disease and age group. The dashed line is the forecast with a .5 confidence interval. The thin black line shows the residuals of the model.

I would not have expected the forecasts to be that good.

I think I have enough insight now to prepare the final plots.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

In this part I could confirm and refine the insights from previous sections. Season of the year and age are the major factors when predicting the disease of a patient with gastrointestinal symptoms. I could not find an explanation for the generally higher incidences in the East.

Were there any interesting or surprising interactions between features?

I could not find a relationship between age and season or region and season for the diseases. I decided to limit further exploration to years from 2008 on, because of a change in incidences I had not noticed before.

Did you create any models with your dataset? Discuss the strengths and limitations of your model.

I forecasted incidences for some diseases and age groups, which were surprisingly accurate. Nevertheless, the forecast is based on the assumption of continuity, which limits the ability to foresee unusual events.


Final Plots and Summary

Plot One

Description One

The plot shows the mean annual incidence of the top five diseases for the age of the patient. The gastro-intestinal symptoms of a middle-aged patient are more likely to be caused by campylobacter than rotavirus.

Plot Two

Description Two

These three plots show the incidence of four diseases during the course of the year for different age groups. The symptoms of a middle-aged patient are more likely to be caused by norovirus in winter and campylobacter in summer.

Plot Three

Description Three

The two plots above show forecasts for the incidence of norovirus for two age groups. The actual incidence (red line) stays mostly within a 50%-confidence interval (grey ribbon) of the forecast.


Reflection

I had to start over twice before I could choose a good selection of data from the source. When I analyzed single variables age and season of the year emerged as major factors.

I could not explain higher incidences in region East when I explored the relationships between two variables. Further, when contrasting with data from another source it became clear that the way how the data is gathered by the Robert Koch Institute limits how well it describes the reality.

My multivariate analysis confirmed and refined earlier findings. I identified changepoints in the data which helped to limitmy data to relevant years. I could not find a relationship between age and season or region and season, but I was able to forecast some diseases rather accurately.

The libraries which I found most useful for analyzing time series are lubridate, forecast, changepoint, and ggfortify. If I was able to access the data source without having to download separate sheets manually I could try to get insights from finer geographical information.